Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement

Computer Assisted Language Learning

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/ncal20

Stephanie Link, Mohaddeseh Mehrzad & Mohammad Rahimi

To cite this article: Stephanie Link, Mohaddeseh Mehrzad & Mohammad Rahimi (2022) Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement, Computer Assisted Language Learning, 35:4, 605-634, DOI: 10.1080/09588221.2020.1743323

To link to this article: https://doi.org/10.1080/09588221.2020.1743323

View supplementary material Published online: 26 Mar 2020.

Submit your article to this journal Article views: 3016

Citing articles: 37 View citing articles

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=ncal20

COMPUTER ASSISTED LANGUAGE LEARNING 2022, VOL. 35, NO. 4, 605–634

https://doi.org/10.1080/09588221.2020.1743323

Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement

Stephanie Linka , Mohaddeseh Mehrzadb and Mohammad Rahimib

aEnglish Department, Oklahoma State University, Stillwater, Oklahoma, USA; bDepartment of Foreign Languages and Linguistics, Shiraz University, Shiraz, Iran

ABSTRACT

Recent years have witnessed an increasing interest in the use of automated writing evaluation (AWE) in second language writing classrooms. This increase is partially due to the belief that AWE can assist teachers by allowing them to devote more feedback to higher-level (HL) writing skills, such as content and organization, while the technology addresses lower-level (LL) skills, such as grammar. As is speculated, student revisions will then be positively impacted. However, little evidence has supported these claims, calling into question the impact of AWE on teaching and learning. The current study explored these claims by comparing two second language writing classes that were

a teacher-only-feedback condition. Findings suggest that using AWE as a complement to teacher feedback did not have a significant impact on the amount of HL teacher feedback, but the teacher who did not use AWE tended to provide a greater amount of LL feedback than AWE alone. Furthermore, students seemed to revise the teacher’s LL feedback more frequently than LL feedback from the com- puter. Interestingly, students retained their improvement in accuracy in the long-term when they had access to AWE, but students who did not have access appeared to have lower retention. We explain the relevance of our findings in relation to an argument-based validation framework to align our work with state-of-the-art research in the field and con- tribute to a broader discussion about how AWE can be best provided to support second language writing development.

assigned to either an AWE þ teacher feedback condition or

KEYWORDS

Automated writing evaluation; automated feedback; teacher feedback; student revision; long-term language gains

Introduction

Despite some suspicion surrounding the effectiveness of written correct- ive feedback (WCF) (Truscott, 1996), research has shown its contribution to students’ accuracy improvement and writing development (Bitchener,‌

CONTACT Mohammad Rahimi rahimim@shirazu.ac.ir Department of Foreign Languages and Linguistics, Shiraz University, Shiraz, Iran

Supplemental data for this article can be accessed here.

2012 ; Ferris, 2010, 2011). In second language writing, as Weigle (2013) and Ferris (2010) speculate, corrective comments are necessary since gaps in students’ linguistic repertoire can pose problems specially in advanced writing. Thus, the overarching question in WCF research has shifted from whether to provide feedback to how best to provide it.‌

In search of efficient ways to comment on students’ writing, research-

ers have tried to devise technologies via which the painstaking responsi- bility of writing evaluation is virtually delegated to (or more precisely, shared with) a non-human evaluator, i.e., a computer. Automated writing evaluation (AWE) is increasingly used as a learning affordance in classroom instruction (Chen & Cheng, 2008; Cotos, 2012; Grimes & Warschauer, 2010). Scholars have taken divergent stances toward machine evaluation; some advocates state that it liberates teachers by providing feedback on formal features of the texts and, thus, offers them more time to dedicate to the higher-level qualities, such as organization, argumentation, and development of the content. (Chen & Cheng, 2008; Dikli & Bleyle, 2014; Stevenson, 2016; Ware, 2011; Weigle, 2013; Wang, 2015). Others maintain that it improves students’ revision skills through the provision of instantaneous, recurrent, and individualized feedback and multiple opportunities for revision (Cotos, 2015; Stevenson & Phakiti, 2014).‌‌‌

Nonetheless, AWE has faced considerable skepticism. Much concern surrounds the negative washback of writing to a non-human audience that lacks human inferencing skills and, hence, can only address formal attributes of writing (Conference on College Composition & Communication, 2009, reaffirmed in 2014). AWE has also prompted levels of distrust because students seem to primarily attend to surface features without giving due attention to propositional content (Chen & Cheng, 2008; Li, Link, & Hegelheimer, 2015). As a result, AWE seems to promote revisions that are at word or sentence level and content remains intact with no substantial changes made to the meaning (Dikli & Bleyle, 2014; Warschauer & Grimes, 2008).‌‌‌

However, as with any other innovation, AWE can produce negative washback probably not because of some inherent weaknesses but because it is misused and not fully integrated into instruction (Cotos, 2012; Warschauer & Ware, 2006). Consequently, as Cotos (2014) advises, there should be a shift in research on AWE where “the ecology of implementation” (p. 59) is brought to the limelight as the impact of AWE depends largely on how it is used (Warschauer, 2010). The debate, therefore, resonates the one mentioned earlier regarding feedback in gen- eral: it is no longer the “if” of the AWE integration but the “how” of it that matters. And in order to draw on the full potential of AWE, it is‌‌

frequently recommended that it be applied as a supplement (rather than replacement) to teacher feedback (Enright & Quinlan, 2010). Positive impact has been said to increase when an AWE tool reviews ear- lier drafts and teacher commentary is added on later drafts (Chen & Cheng, 2008; Dikli & Bleyle, 2014; Grimes & Warschauer, 2010; Ware, 2011; Warschauer, 2010; Weigle, 2013). The ideal hybrid is more likely the one in which AWE tools provide sentence-level feedback while teach- ers can attend to higher-order concerns (Weigle, 2013). On the whole, a combination of machine and teacher feedback is viewed to be optimal (Wang, 2015).

Ironically, the lingering problem in AWE research is that debates are

mostly based on hypothetical claims that, though plausible, have not been fully supported empirically; research on AWE, particularly its use in L2 writing, is in its infancy (Cotos, 2012). Studies on the impact of AWE programs are inconclusive (Stevenson & Phakiti, 2014; Wang, 2015) and significant concerns are expressed regarding the possible positive or nega- tive effects on key stakeholders, teaching, and learning (Cotos, 2012; Ware, 2011). More specifically, the impact of automated feedback on learning over an extended period of time is underexplored (Xi, 2010), and the big question of whether WCF has a lasting effect on students’ writing improvement, still deserves special research efforts (Liu & Brown, 2015). To this end, the present study takes into consideration developers’ recommendations concerning the optimum way in which AWE should be deployed, that is, as a complement to teachers. After all, as Ware (2011) maintains, “time is ripe for critically examining [AWE’s] potential use as a supplement to writing instruction, not a replacement” (p. 770). We conducted a comparative study weighing an AWE teacher feedback condition against a teacher-only-feedback condition. Findings help to understand the impact of AWE use on teacher feedback, student revision, and short-/long-term gains in language ability. Our work contributes to the lasting question of whether and how AWE feedback can contribute to the L2 writing classroom, a concern that has long muddied the picture in WCF and AWE fields.‌

Argument-based validation of AWE

In Cotos’s (2012) view, AWE is still a “conundrum” (p. 84) because the findings of previous studies do not lend clear support to its effectiveness. Later, too, she reaffirmed that the evidence collected to evaluate AWE use has been “relatively inconsistent and, hence, insufficiently informative” (Cotos, 2015, p. 5). Owing to the inconclusiveness of research on WCF, and on AWE in particular, Bennett (2011) advised

that greater care is needed when evaluating evidence, and as he asserts, such meticulous evaluation is manageable through developing a validity argument. AWE researchers have called for the use of validity argument framework for a more systematic appraisal of automated evaluation (e.g., Chapelle, Cotos, & Lee, 2015; Williamson, Xi, & Breyer, 2012). This validation approach helps integrate multiple pieces of validity evidence to argue for the validation of an assessment instrument (Kane, 2013; Enright & Quinlan, 2010).‌

In an attempt to delineate the argument-based validation of AWE,

Xi (2010) offered a series of questions that pertain to the accuracy, reliability, and meaning of automated feedback as well as the impact of automated feedback on student learning. Questions about impact include whether automated feedback leads to positive impact on teach- ing, improvements in learners’ performance, and gains in targeted areas of language ability that are sustainable in the long term. These questions also drive much of the aforementioned controversies surrounding WCF and AWE and are thus what guide the current study.

Impact of AWE on teacher feedback

Research on the impact of AWE on teaching has addressed how teachers use and perceive automated feedback (Link, Dursun, Karakaya, & Hegelheimer, 2014). However, comparative research is needed to deter- mine whether there are differences when AWE is used or not. To our knowledge, only three studies take this comparative angle (Dikli, 2010; Dikli & Bleyle, 2014; Wilson & Czik, 2016). Dikli (2010) and Dikli and Bleyle (2014) carried out comparative examinations of teacher feedback and automated feedback produced by two AWE tools, i.e., MyAccess! and Criterion, respectively. Both studies found large discrepancies between the two types of feedback and reached the same conclusion: teacher feedback excelled. While its automated counterpart was found to be lengthy, generic, and occasionally redundant or unusable, teacher feedback was more concise, focused, formative and individualized. The instructors provided not only a larger amount of feedback but also more accurate or, as the authors put it, better quality, feedback. Wilson and Czik (2016) compared the amount, type, and level of teacher feedback on middle school students’ L1 English writings in two feedback condi- tions, i.e., teacher-feedback-only condition and combined teacher auto- mated feedback condition. Feedback type included different manners in which feedback is presented (manners like directive, query, informative, and praise) and feedback level referred to the specific writing skills and components that the feedback addresses. Feedback level is characterized‌‌

as a dichotomy of lower-level and higher-level subskills (or roughly speaking, surface-level and content-level skills, respectively). Results of this study showed that teachers used the same amount of feedback in both conditions. However, the amount of teacher feedback on higher- level writing subskills in the AWE condition significantly exceeded the one in the teacher-feedback-only condition. As the authors concluded, this finding provided partial support for the claim that, by assisting stu- dents in doing lower-level revisions, AWE frees up teachers to concen- trate on higher-level aspects of writing. The current study extends Wilson and Czik’s study by investigating the impact of automated feed- back on teachers in L2 writing classrooms. In fact, L1 and L2 English students are characterized as having distinctive writing abilities and needs (Weigle, 2013a). L2 learners generally seem to welcome language corrections and teachers might feel this urge to be responsive to the grammatical errors (Hyland & Hyland, 2019). Moreover, as Biber, Nekrasova, and Horn (2011) speculated, teachers are more authoritative figures in many non-western cultures than in western societies and hence, it is likely that teacher feedback is more influential for L2 (or EFL) students than for L1 students. However, their meta-analysis indi- cated a reverse trend. L2 learners seem to be more receptive to other sources of feedback (including peer and computer feedback) than teacher feedback. This general indication demands further exploration (Biber et al., 2011; Liu & Brown, 2015). Thus, the language learning context appears to have an effect on teachers’ feedback practices and, in this way, this research extension is deemed to be illuminating. We attempted to corroborate this by examining students’ revision practices.‌

Impact of AWE on student revision practices

Only a few studies have explored AWE’s potential in promoting stu- dents’ redrafting or encouraging different types of revision. These studies examined the effect of AWE on students’ revision practices by using three criteria: the number of essay submissions, acts of revision, and text modification. Attali (2004) and Li et al. (2015) used the number of sub- missions as an indicator of students’ use of an AWE tool. Whereas Attali reported the majority of the essays were submitted only once by 6th to 12th graders, Li et al. found AWE assisted redrafting among university ESL students. Although different learning contexts and objectives can in part explain contradictions in previous findings, as Li et al. suggested, the criterion of the number of submission alone may not provide an accurate assessment of revision practices.

Thus, acts of revision can also be informative. Chapelle et al. (2015) examined acts of revision of ESL university students after receiving feed- back from Criterion. Their revision classification included six categories depending on presence and manner of revision: no change, remove, add, delete, change, transpose. They found that although 50% of feedback resulted in no change in students’ writing, 70% of remaining revision acts were successful; providing some positive evidence that Criterion feedback is useful for students in making decisions on revision. While this information provides a valuable account of what decisions students make when using AWE feedback, little is known about what contributes to their decisions beyond mere accuracy of the automated feedback.

In an attempt to account for students’ revisions practices more

descriptively, studies have investigated text modifications based on AWE feedback. For example, Cotos, Link, and Huffman (2017) analyzed native and nonnative students’ text modifications based on feedback from a genre-based AWE tool. Their classification included modification to con- tent, lexicon, grammar, structure, and mechanics. They, however, did not examine the relationship between feedback type and revision practice. In the present study, we attempt to gather insight into this association through an analysis of students’ acts of revision and text modifications as they relate to the provision of teacher and AWE feedback.

Impact of AWE on writing improvement

Although evidence supports the positive potential of AWE feedback on writing development, results are inconclusive and thus insufficient in establishing its validity for use in L2 writing classrooms. The impact on accuracy, in particular, is unclear. Researchers have found that AWE feedback has a positive impact (Li et al., 2015; Li, Feng, & Saricaoglu, 2017; Wang, Shang, & Briody, 2013) or no impact (Ware, 2014) on accuracy. In their investigation of the impact of AWE feedback on over- all writing accuracy of ESL university students, Li et al. (2015) used Criterion error reports (the number of errors identified and classified by Criterion) and found improvement in linguistic accuracy from first to final drafts. In a similar study, Li et al. (2017) found that the transfer of short-term, positive effect on accuracy improvement to a long-term gain was only evident in one error category (out of nine categories), i.e., run- on sentences. In addition, Wang et al. (2013) quasi-experimental study found that after receiving assistance from the AWE tool, CorrectEnglish, the experimental group outperformed those in the control group in terms of grammar, word usage, and spelling.‌‌

In contradiction, Ware (2014) found that accuracy (error rates) did not change from pretest to posttest across three feedback conditions: oral peer feedback on pen-and-paper drafts, teacher electronic feedback, and automated feedback from Criterion. Nevertheless, students did show improvement in their holistic scores, fluency scores, and genre elements scores; that is, human (peer or teacher) feedback had a significantly stron- ger impact on students’ uptake of genre features than did the automated feedback. Further confounding findings across AWE studies, Wilson and Czik (2016) showed no statistically significant differences between condi- tions (i.e., teacher-feedback-only condition and AWE teacher condi- tion). Their analysis of students’ final-draft writing quality after several drafts of the same paper included the AWE holistic score and trait level scores (i.e., the score that indicates if your writing skills are Developing, Proficient, or Advanced in regard to word choice, conventions, and organization, development and style) from PEGWritingVR , and a holistic quality score determined by the researchers.

Overall, the current research expands the available research base on WCF. So far, holistic scores have been the most popular measure of writing improvement. However, it falls short of identifying the specific writing skills that have improved (Biber et al., 2011). In fact, Wilson and Czik (2016) advised using fine-grained linguistic measures that bet- ter capture changes in writing quality. Also, studies have mostly focused on accuracy measures, neglecting other fundamental dimen- sions of writing improvement, e.g., complexity and fluency (e.g., Van Beuningen, 2010). Finally, among the handful of WCF studies analyzed in Biber et al. (2011), no pretest-posttest designs and very few treat- ment-control designs operationalized WCF effectiveness in terms of the number/extent of revision. Therefore, revision quantity/quality is an underexplored domain in WCF research. Accordingly, in this study, students’ improvement is assessed by using CAF (complexity, accuracy, and fluency) measures. Finally, the majority of the current feedback studies have been either cross-sectional or measured effectiveness after a short interval (Karim & Nassaji, 2019), or else, have confined CF effectiveness as impact on revised texts rather than on new pieces of writing (Liu & Brown, 2015). The research design should incorporate a delayed posttest with an interval longer than the one or two weeks that has been typically used in previous research (Karim & Nassaji, 2019). Hence, this study attempts to respond to Wilson and Czik’s call for more extended intervention designs that ensures learners’ intensive exposure (i.e., making multiple revisions to multiple prompts), analyz- ing not only short-term transfer of learning but more sustained reten- tion of it.‌

Research questions

According to recent syntheses of studies on WCF (Karim & Nassaji, 2019; Liu & Brown, 2015) and meta-analyses of the efficacy of WCF (Biber et al., 2011; Kang & Han, 2015; Van Beuningen, 2010), research in aggregate seems to indicate some effect of feedback. However, there has been a lot of variation in the findings, especially with respect to the effect size (Kang & Han, 2015). This heterogeneity has been mostly attributed to research design flaws (Karim & Nassaji, 2019; Liu & Brown, 2015; Van Beuningen, 2010). On the whole, there seems to be a persist- ent limitation in previous research on feedback and AWE, in particular; that is, most study designs (e.g., Attali, 2004; Warschauer & Grimes, 2008; Li et al., 2017) lacked a control group (Liu & Brown, 2015). As Link et al. (2014) maintained, research cannot provide a convincing argument for or against AWE when there is no control group. And, when research did include a control group (as in Wang et al., 2013 or Wilson & Czik, 2016), no description of the nature of the teacher feed- back in the control group was provided. According to Stevenson and Phakiti (2014), further research investigating the relative effects of AWE and teacher feedback is needed where more in-depth explanation of the nature and quality of teacher feedback is presented, a point that is attended to in the design of the current research.

Thus, this study purports to find an answer to the following research questions using two conditions: AWE þ teacher feedback condition (herein called AWE group) and teacher-only-feedback condition (herein called Teacher group). Informed by Xi’s (2010) guiding questions,

research question (RQ) 1 addresses the impact on teaching, question 2, the impact on learners’ revision performance/practice, and question 3, the short- and long-term gains in language ability.

RQ1: Is there a difference in amount and level (higher- versus lower-level) of teacher feedback between the AWE and Teacher groups?

RQ2: Are there differences in student revision practices between the two groups?

RQ3: Are there differences in short- and long-term effect on the students’ writing improvement in terms of CAF (Complexity, Accuracy, Fluency) between the two groups?

Material and methods

This study was conducted in the department of Foreign Languages and Linguistics at Shiraz University in Iran. Prior to conducting this research, the ethical aspects of this study were approved by the Graduate Studies

Committee of the Department of Foreign Languages and Linguistics and the University Graduate Studies Committee at Shiraz University.

Setting and participant

Broadly speaking, product-based approaches characterize English writing instruction in our EFL context and the process- and genre-based practi- ces or formative assessment tools are rarely employed. Also, teachers’ feedback provision strategies are generally retrospective and put students’ grammatical accuracy in the limelight. At Shiraz University, undergradu- ate students majoring in English are offered four, 2-credit writing courses beginning from their third semester. Data for the study were collected from an essay writing course in the seventh semester. It focuses on text- based argumentative writing wherein the students are required to write academic arguments using ideas presented in a text on a controversial issue. Improving grammar and correct use of language constituted the secondary objective of the course as it mainly focused on Learning to Write, rather than Writing to Learn (See Mancho'n, 2011) though it par- tially looked at the Language Learning Potential of L2 Writing, too (Bitchener, 2012).

A total of 32 undergraduate English majors with an average age of 22 years old enrolled in two sections of the essay writing course. Then, the two groups were randomly assigned to either the AWE group or Teacher group. However, four students in the AWE group dropped the course for personal reasons, leaving a final sample of 12 students in the AWE group (6 males, 6 females) and 16 in the Teacher group (4 males, 12 females). The students in AWE group had no previous experience with AWE.

At the beginning of the study, all students took the Quick Placement Test (QPT) (2001), which was developed by Oxford University Press and Cambridge ESOL. The results of a Mann-Whitney U Test showed no sig- nificant difference between the groups, which suggests that the two groups started with similar proficiency levels: AWE Group (Mdn 43),

¼ ¼

Teacher group (Mdn 45), U 124, p > 0.05.

Both classes were taught by the third author (a professor of language and linguistics at the university) and two teaching assistants. The assist- ant for the AWE group was a PhD candidate with nearly 7 years of experience in teaching L2 writing. The assistant in the Teacher group was a Master’s student who had taught L2 writing for about 5 years. The students in both groups were informed of the nature and purpose of the study and signed a participation consent form, which indicated that their identities would remain confidential through codifying their personal

information and that they would have the right to withdraw from partic- ipating in the research any time during the experiment. However, since the research procedure was part of the natural writing class, withdrawal just meant announcing that they would not like their data to be used for research purposes. Given that the third author was the teacher of both classes and the students might have been worried that withdrawing from the research would affect their teacher’s evaluation of their performance, they were allowed to inform the teacher of their withdrawal from research after the course grades had been announced.

Description of the AWE tool

Criterion is a web-based, teacher-led formative writing assessment tool that provides instant diagnostic feedback on students’ essays. Developers emphasize that it is not to replace teacher instruction and feedback as it falls short of evaluating essay content and style (ETS, 2019). Teachers can embed feedback using the teacher comment feature. Teachers also have access to Criterion’s library of over 180 essay topics of different genres including narrative, expository and persuasive; or they can use their own prompts. Criterion trait feedback analysis addresses five cate- gories of errors in student essays; namely, grammar, usage, mechanics, style, and organization and development. The feedback is unfocused and usually indirect. However, sometimes suggestions of the correct form or brief metalinguistic explanation are also provided. Supplementary fea- tures include a Writer’s Handbook and sample essays. Criterion also allows multiple revisions of an essay and provides evaluation reports that track students’ progress over time. In addition to trait-specific feedback, a holistic score (on a 1-6 scale) and a trait-level graph (including three levels of developing, proficient, and advanced) are provided with brief descriptions of each level.

Pre-, post-, and delayed post-tests

On three separate occasions, students were asked to complete a 105- minute, argumentative writing task where they needed to take a side about their stance towards technology and provide reasons and examples for their position. Prompts were from Criterion’s topic library (Appendix A, Supplementary material), but the pre- and post-test prompts between groups were slightly different in order to avoid cheating. All prompts were related to education or technology as these were familiar subject areas to these students. They were also given a text on the same topic to generate ideas for their arguments and counterarguments.

Teacher training

The AWE Group assistant provided only higher-level comments on con- tent and organization for the AWE group condition. The Teacher Group assistant gave both higher- and lower-level feedback in the Teacher group. Regarding the higher-level feedback, teachers in both groups were encouraged to do what they would normally do in an essay writing course. More precisely, they were asked to consider their students’ needs and the course objectives when providing such feedback.

For lower-level feedback, the main instructor provided training based on a guide adopted from Ferris and Hedgcock (2005). Abbreviated error codes adapted from Ferris and Roberts (2001) and Dikli and Bleyle (2014) facilitated commenting. Training consisted of independent review- ing, commenting, and scoring of student texts as well as group discus- sions about discrepancies. In both treatments, the feedback on accuracy was comprehensive (rather than focused) and largely indirect. While research has argued for a more focused/selective approach to treatment of linguistic errors (e.g., Lee, 2011, 2019), it is worth mentioning that unfocused, or comprehensive, feedback practice has been recommended for future research since it is arguably a more authentic classroom strat- egy and adds to the ecological validity of the method (Ferris, 2012; Liu & Brown, 2015; Van Beuningen, 2010). A recent study on focused versus comprehensive feedback (Rahimi, 2019) revealed that although the for- mer is more effective for the improvement of the target structures, the latter is more influential in realizing a more authentic classroom object- ive, i.e., improving the overall written accuracy (Van Beuningen, 2010).‌‌‌‌‌

Study procedures

This study spanned a 16-week semester with one, two-hour session per week. In week 1, students took the QPT. Weeks 2 to 5 were devoted to introducing the principles of argumentative writing. Starting from week 6, students completed the pre-test that they later revised and submitted for a grade. After submitting each draft, the AWE group received auto- mated feedback immediately and content feedback from the teacher within two days. Students in the Teacher group received teacher’s lin- guistic and content feedback also within two days.

Students took the post-test at the end of the 16 weeks and then the delayed post-test three months after the semester ended.1 They were again given a source text and the prompt. Also, all students wrote the delayed essay on paper. This allowed us to examine the more generalized improvements in writing proficiency and the transfer of the effects of

AWE to texts for which no AWE feedback has been provided, as sug- gested by Stevenson and Phakiti (2014).

Study measures and analysis

RQ1: Impact on teacher feedback

Impact of AWE on teaching was determined by examining the amount and level of teachers’ higher- versus lower-level feedback between groups. First, all AWE and teacher feedback on students’ pretest were extracted and parsed into the constituent idea units; an idea (or feedback) unit has been described as a comment that targets a single feature in the text (Wilson & Czik, 2016). Next, these self-contained comment segments were assigned a level and skill based on Wilson and Czik (2016) coding scheme for feedback.

Higher-level (HL) writing skills (4 categories): ideas and elaboration, organization, style, and self (i.e., feedback directed at the author’s writing process or experience)

•

Lower-level (LL) writing skills (7 categories): spelling, capitalization, punctuation, sentence structure, grammar, formatting, and word choice2

To verify the reliability of the coding, the first and the second authors independently coded a random 10% of feedback units (n ¼ 47).

Discrepancies between codes were discussed to add credibility to the coding process. A second round of independent coding was employed using a new randomized set of 10% of the data; percent agreement for level and skills were 94.44% and 91.67%, respectively. Discrepancies were again discussed, and the second author coded the remainder of the data.

RQ2: Impact on student revision practice

Impact on student revision practices was determined by examining acts of revision and text modification. Revisions were first parsed using pro- cedures described for RQ1. To examine students’ revision acts, we uti- lized a framework described in Chapelle et al. (2015), which was first established in Ferris (1997) and modified by Lee and Hegelheimer (2012) to assess revisions based on AWE feedback in research similar to the present study (see Table 1).

To analyze text modifications, Wilson and Czik (2016) coding frame- work, described in RQ1, was modified and refined to fit the context of the study. Specifically, the component “self” in the framework was omit- ted since it was described as the feedback that pertains to commentaries

Table 1. Coding scheme (with definitions and examples) for analyzing revision acts.

Examples of errors (with accompanying teacher or AWE feedback)

Revision Act Definitiona

First draft Feedback Unit Last draft

No change Students did not make a change

attempting to correct identified error.

Based on the research, around two third of people worldwide prefer to send text messages rather than calling or sending emails.

What research? (Teacher feedback) Based on research around two third of

people worldwide prefer to send text messages and emails, rather than to call or meet in person.

Remove The part of an essay including a

highlighted error was removed in the process of changing content and organization.

Add Students revised the essay by adding word(s), phrase(s), or sentence(s).

As the name suggest IM doesn’t take time to be delivered so it no longer holds the form of a letter,

In contrast to the idea that texting is destroying literacy and English language, using text-speak, or language specific to text messages,

This subject and verb may not agree. Proofread the sentence to make sure the subject agrees with the verb. (AWE feedback)

You may need to use an article before this word. (AWE feedback)

Not found in the second draft

COMPUTER ASSISTED LANGUAGE LEARNING

In contrast to the idea that texting is destroying literacy and the English language, using text-speak, or language specific to text messages,

Delete Students revised the essay by deleting word(s), phrase(s), or sentence(s).

you have to encounter with the situation and solve your problem.

Prep (Teacher feedback) you have to encounter the situation and solve your problem.

Change Students revised the essay by changing word(s), phrase(s), or sentence(s) with alternatives.

Transpose Students revised the essay by

changing the order of words, phrase(s), or sentence(s).

The use of high-tech visual and physical aids has made life more tolerable for the handicapped or the blind.

I can see how the old school correspondence naturally will be replaced by E-mail and instant messages and there will come a time where paper letters will be obsolete and nothing more than a distant memory. So why not start introducing our young to the mechanism and show them how it’s done

properly. [ … ]

This word is sort of offensive. (Teacher feedback)

Your argument in this paragraph is irrelevant. You should focus on why or why not teenagers should be encouraged to use these and if they would improve their reading and writing skills.

The use of high-tech visual and physical aids has made life more tolerable for the disabled or the blind.

Moved paragraph from the first body paragraph in the first draft to the last body paragraph in the last draft.

617

aDefinitions adopted from Chapelle et al. (2015) and Lee and Hegelheimer (2012).

about ‘the author’s performance, process, effort, or experience’ and was not, therefore, irrelevant for the analysis of revision.

Inter-coder reliability was again conducted as described previously. Final agreement was 92.31% and 97.44% for revision acts and text modification, respectively. The success of these revisions, based on Cunningham (2019), was then analyzed to determine whether students’ revision practices resulted in successful or unsuccessful revision acts and text modifications.

RQ3: Short- and long-term impact on learning and learners’ performance Students’ pre-, post-, and delayed post-tests were assessed by using CAF measures to determine writing improvement. Syntactic and lexical com- plexity were investigated using the L2 Syntactic Complexity Analyzer (L2SCA) (Lu, 2010). Syntactic complexity has been long conceived as a multi-componential construct; the assessment of which should incorpor- ate measures of subordination, coordination, and phrasal complexity (Housen, Kuiken, & Vedder, 2012; Johnsons, 2017; Norris & Ortega, 2009). For the purpose of this study, four syntactic complexity indices were chosen based on previous research (Chen, Xu, & He, 2014; Lu, 2017, 2011). They included mean length of clause (MLC), dependent clause per T-unit (DC/T), coordinate phrases per clause (CP/C), and complex nominal per clause (CN/C).‌‌‌‌

Recently, the inclusion of judiciously selected metrics of lexical diver- sity and sophistication have been underscored in SLA studies (e.g., Johnsons, 2017) because the conjectures about the potential tradeoffs between lexical and syntactic complexity and also between the two lexical subcategories (Verspoor, Schmid, & Xu, 2012; Yoon & Polio, 2016) can be authenticated. With the aid of the computational tool Coh-Metrix 3 (McNamara, Graesser, McCarthy, & Cai, 2014), we computed one metric of lexical diversity, i.e., measure of textual lexical diversity (MTLD), and two indices of lexical sophistication, i.e., average word length (WL) and word frequency (WF). Each are considered valid measures of L2 profi- ciency as described previously (Yoon & Polio, 2016).‌‌‌

Based on the work of Polio and Shea (2014), we selected error-free clauses (EFC) as the accuracy measure in this study. One measure was chosen to avoid redundancy since accuracy measures are all related to the numbers of errors in one way or another (Polio & Shea, 2014). Polio’s (1997) guidelines for coding clauses and errors were adopted in this study. The inter-coder reliability was 0.95.‌

Fluency is usually assessed using measures of amount of production. The most common fluency metric is total word count with constant composing time (Plakan, Gebril, & Bilki, 2019; Johnson, Mercado, &

Acevedo, 2012). The number of structural units such as clauses have been also used to capture the amount of discourse a student can produce in a set time frame (Polio, 2001). Consequently, average number of words and clauses per text were the two metrics used in our study.

Statistical analysis

Chi-square tests were conducted on the frequency of HL and LL feed- back to examine group differences for RQ1 & 2. Bonferroni post hoc tests were performed to determine the location of statistically significant differences. Due to having an unequal and small sample size, we used the Friedman Test, which is the nonparametric alternative to a one-way ANOVA with repeated measures, to examine group differences for RQ3. In this analyses, we used CAF measures as dependent variables to com- pare improvement across pre-, post-, and delayed post-tests. Effect sizes

for the Friedman Test are reported as r, where r ¼ pZﬃﬃﬃ and interpreted

based on the Plonsky and Oswald (2014) benchmarks for understanding effect sizes in L2 research.

Results

RQ1: Differences in teacher feedback

To determine whether there was an overall difference in total feedback units provided between feedback condition, normalized frequencies were calculated based on the number of feedback units per 1000 words. Interestingly, the use versus nonuse of Criterion resulted in distinct differ- ences in the total frequency of feedback. The teacher in the Teacher Group provided 22.60 feedback units per 1000 words whereas in the AWE Group, the teacher provided 10.15 feedback units and Criterion provided 11.70 feedback units (a total of 22.72 feedback units) per 1000 words. With potential teacher variables taken into consideration, it is not surprising that the teacher provided nearly half the amount of feedback when the AWE tool was available. Research has suggested, however, that the complemen- tary role of AWE tools can enable teachers to provide more higher-level feedback (Weigle, 2013); thus, we examined possible differences in fre- quency distribution of HL versus LL feedback between the two groups.

¼ ¼

A chi-square test of independence showed that the use of AWE had a significant influence on the level of feedback provided, v2 (1, N 393) 4.26, p < .05, / .104 (Figure 1). Interestingly, teachers in both groups seemed to provide similar frequency of HL feedback with and without

the use of AWE. In regard to LL feedback, students in the AWE group tended to receive less LL feedback from Criterion than the students

Figure 1. Frequency of higher-level versus lower-level feedback on students’ writing by group. Higher-level feedback is from the teacher in both groups. Lower-level feedback is from the teacher in the Teacher group and from Criterion in the AWE group.

received in the Teacher group from their teacher, which seems to indi- cate that the AWE tool was not providing the quantity of LL feedback that a teacher would otherwise provide if the AWE tool was not used.

For the feedback category, chi-square results showed that the use of

¼ ¼

AWE did not seem to have an impact on what HL feedback was provided (v2 (5, N 168) 2.58, p > .05), suggesting that the teachers were not

more likely to provide a specific category of HL feedback due to the use or non-use of AWE. However, there was a statistically significant relationship

¼ ¼

between group and the category of LL feedback provided, v 2 (5, N ¼ 231) 49.46, p < .0001, / .463. A Bonferroni post hoc test was performed to

¼ ¼

determine what variables contribute to the significance. Criterion in the AWE group attended to an array of feedback categories, with a signifi- cantly higher proportion of punctuation errors (N 42) provided to stu- dents compared to the Teacher group (N 23), p .0001. The teacher in the Teacher group also attended to an array of feedback categories, with a significantly higher frequency of grammar and word choice feedback (p

.001), which is likely due to the provision of teacher feedback being more humanistic, resulting in a higher sensitivity to salient errors.

RQ2: Differences in student revision practices

It is interesting to note that for both groups, students seemed to respond to teacher feedback similarly (see Figure 2). In both groups about 12% of feedback from the teacher resulted in no change, and notably in the

Figure 2. Percentage of student revision acts based on group and feedback source.

AWE group, about 24% of feedback from Criterion resulted in no change. Furthermore, 3% and 9% of feedback from the teacher in the AWE and Teacher groups, respectively, resulted in removal of the error whereas 21% of Criterion feedback resulted in removal. These results may be evidence to suggest that students were avoiding or doing away with a problem (Victori, 1995) rather than attempting a successful revi- sion based on the AWE feedback. This strategy is commonly referred to as reduction or a “play it safe” mentality (Hulstijn & Marchena, 1989, p. 241). More cognitive research or perception-based studies should be conducted to understand students’ revision processes and reasons for dis- missing AWE feedback more frequently than teacher feedback.‌

Figure 3 shows that of the remaining revision acts (add, delete, change, transpose), the success of overall revisions was relatively similar across both groups: Teacher Group (84%) and AWE Group (86%). Of the 16 feedback units unsuccessfully addressed in the AWE Group, only four were due to AWE feedback. Although the success of revision is worthy of further investigation, we observed that some unsuccessful revision may be partially due to the indirect nature of the AWE feedback and perhaps the students’ ability to accurately apply linguistic rules, as in the following example where Criterion highlighted the whole sentence in the student’s first draft and provided indirect feedback about punctuation. The student added a comma incorrectly.

Example 1:

First Draft: So to not catch up to the pace of that progression would be a gear not grinding in the complex machine of advancement.

AWE feedback: Review the rules for using punctuation marks.

Figure 3. Percentage of successful and unsuccessful revision acts per group.

Revision: So to not catch up to the pace of that progression, would be a gear not grinding in the complex machine of advancement.

Future research should consider using interviews, think alouds, or stimulated recalls to fully capture the reasons behind students’ revision acts. A closer examination of the accuracy of AWE feedback should also be taken into account since previous research has shown that the success of students’ revisions is often linked to the accuracy of the feedback engine (Chapelle et al., 2015).

To further explore the nature of students’ revision practices, a chi-

square test was again performed to determine differences in students’

¼ ¼

text modifications between groups (Figure 4). A statistically significant difference emerged between groups in terms of both level (v2 (1, N ¼ 262) ¼ 13.44, p < .0001, /¼.226) and category (v2 (1, N ¼ 262) ¼ 29.64, p .001, / .336) of revision. As expected based on feedback pro-

¼ ¼ ¼

vision, there was a high frequency of revision on ideas and elaboration. Students in the Teacher group seemed to revise this category more fre- quently even though the teacher in both groups tended to evenly com- ment on this category. For LL revisions, a Bonferroni post hoc test was again performed to determine significance. Interestingly, the AWE group revised the grammar category less frequently (N 12) than the Teacher group (N 59), p .033. The Teacher group (N 16) also revised a higher proportion of punctuation errors compared to the AWE group

(N ¼ 13), p ¼ .006.

Figure 4. Frequency of higher-level versus lower-level text modifications by group.

Through exploration of whether these revisions were successful, we again found similarities between both groups. The Teacher Group suc- cessfully revised 87% and the AWE Group revised 90% of the total feed- back provided. Figure 5 provides a closer look at LL feedback, which was the type of feedback Criterion provides. Results show that the AWE tool provided a high number of punctuation feedback, but students unsuc- cessfully addressed 8%. As Example 1 above illustrated, this could be due to the indirect nature of the feedback (among other possibilities that should be examined further). However, the indirectness of the feedback may not be the only concern. In Example 2 below, Criterion underlines the word “every” and then provides two-fold feedback. First, it directs the student to the determiner, which may be misleading, but then sug- gests to the student that the determiner should agree with the word it modifies, which seems appropriate. However, the student shows evidence of fixating on the determiner rather than the proceeding noun.

Example 2:

First Draft: Although It is obvious textese play an important role in every fields of study …

AWE feedback: You may have used the wrong determiner. Proofread the sentence to make sure that the determiner agrees with the word it modifies.

Revision: Although this is obvious text-speak play an important role in nowadays fields of study …

Interestingly, the AWE feedback was not altogether inaccurate, but it is likely that the student was either not able to parse the feedback or unaware

Figure 5. Percentage of successful and unsuccessful text modifications per group.

that the feedback should be examined more thoroughly. Nevertheless, students’ text modifications were overall relatively appropriate.

RQ3: differences in short- and long-term writing improvement

A series of Friedman tests of significance were carried out to determine possible differences within the two conditions from pretest-posttest-delayed posttests in regard to CAF measures as indicators of short- and long-term improvement. Unlike Ware’s (2014) study, students in the present study showed indications of accuracy improvement. For the AWE group, the only significant difference between pretest-posttest-delayed posttests was for accuracy (see Table 2). The Teacher group, on the other hand, showed improvement in not only accuracy but 8 of the 9 CAF measures.

To determine whether the significant findings from the Friedman tests were maintained in the delayed posttests, post hoc analyses using

Wilcoxon signed-rank tests were performed with a Bonferroni correction applied (significance p < .003). Table 3 shows the results of the analy- ses. It is worth noting that the AWE group retained their improvement of accuracy (EFC) from pretest to delayed posttest (Z ¼ -3.059, p ¼

.002, r ¼ .883). However, the Teacher group seemed to not maintain

Table 2. Friedman test results for differences between complexity, accuracy, and fluency measures of pretest-posttest-delayed posttests.

AWE Group (df ¼ 2, N ¼ 12) Teacher Group (df ¼ 2, N ¼ 16)

v2 p Kendall’s W v2 p Kendall’s W

Syntactic Complexity

MLC	2.000	.083	.368	30.125	.000	.941
DC/T	3.500	.174	.146	32.000	.000	1.000
CN/C Lexical Complexity MTLD	4.667 2.167	.097 .338	.194 .090	32.000 8.375	.000 .015	1.000 .262
WL	2.000	.368	.083	20.667	.000	.646
WF Accuracy	3.872	.144	.161	.387	.824	.012
EFC 17.167		.000	.715	11.375	.003	.355
Fluency average words	.130	.937	.005	13.000	.002	.406
clauses per text	.809	.667	.034	7.625	.022	.238

Syntactic complexity measures: mean length of clause (MLC), dependent clause per T-unit (DC/T), complex nominal per clause (CN/C). Lexical complexity measures: measure of textual lexical diversity (MTLD), average word length (WL), word frequency (WF). Accuracy: error-free clauses (EFC). Fluency: average number of words, clauses per text.

Table 3. Wilcoxon signed-rank tests for differences between pretest-posttest and pretest- delayed posttest.a

Pretest-posttest Pretest-delayed posttest

Z pb r Z p r

AWE Group

Accuracy

- -

EFC 2.275 0.023 .657 3.059 0.002 .883

Teacher Group

Syntactic Complexity

- -

MLC 3.516 .000 .879 3.516 .000 .879

- -

DC/T 3.516 .000 .879 3.516 .000 .879

- -

CN/C 3.516 .000 .879 3.516 .000 .879

Lexical Complexity

- -

MTLD 2.844 .004 .711 1.5 .134 .375

- -

WL 3.518 .000 .880 2.457 .014 .614

Accuracy

- -

EFC 2.947 .000 .879 2.896 .004 .724

Fluency

average words -2.956 .003 .737 -3.104 .002 .776

clauses per text -2.327 .003 .739 -2.068 .039 .517

aMeasures chosen based on significant results of previous Friedman tests

bp < .003 are significant based on Bonferroni correction

their level of improvement (Z ¼ -2.896, p ¼ .004, r ¼ .724). For other measures within the Teacher group, students were able to maintain their improvement in syntactic complexity and one measure of fluency, i.e., average words.

General discussion

Despite an increased interest in use of AWE, research in the L2 context has been limited (Cotos, 2012) and results have been inconclusive

(Stevenson & Phakiti, 2014; Wang, 2015), especially in relation to long- term impact (Xi, 2010) and in comparison to non-use of AWE (Wilson & Czik, 2016). In many ways, our research begins to fill these gaps through a comparative investigation of use and non-use of Criterion in second language writing classrooms. To add further to current AWE research, we utilized a unitary approach to gathering research evidence (e.g., Xi, 2010; Chapelle et al., 2015) by discussing our evidence through the lens of argument-based validation and the inference of impact.

Argument-based appraisal of AWE

There are three underlying assumptions that the current study addressed concerning impact of AWE: (1) AWE teacher feedback has a positive impact on the amount and level of teacher feedback provided to students,

(2) AWE teacher feedback has a positive impact on the amount and level of student revision, and (3) AWE teacher feedback has a positive short-term impact on students’ writing in terms of complexity, accuracy, and fluency and the improvement is sustainable in the long term.

Assumption 1: Positive impact on teacher feedback

Findings seemed to provide only partial support in favor of AWE use for positively impacting teacher feedback. In terms of amount, the teacher in the AWE Group tended to provide less overall feedback compared to the teacher in the Teacher Group. The reasons for this finding were under- explored in this study, but one explanation may be because Criterion provided complementary feedback that enabled the teacher to comment less. Individual teacher differences could have confounded these results, so further research is needed to explore this possibility.

For level of feedback, the use of Criterion did not result in a higher frequency of HL teacher feedback, as research has speculated (Chen & Cheng, 2008; Dikli & Bleyle, 2014; Stevenson, 2016; Ware, 2011; Weigle, 2013; Wang, 2015). Teachers in both groups seemed to provide similar frequency of HL feedback. Thus, the teacher in the AWE Group did not appear to capitalize on the presumed saved time to amplify higher-level feedback; yet, seen in the broader context, the finding denotes that teach- ing practices (including commenting) are a variable of course objectives. In fact, here, Learning to Write (LW) orientation might have molded feedback practices in both conditions. Trying to be responsive to the ultimate course goal, teachers in both groups, including the one in Teacher Group, might have attempted to address the content-based errors in a comprehensive manner which can in part explain the approximately equal amount of higher-order feedback across the two

groups. That is why Weigle (2013) believes making a distinction between LW and WL (Writing to Learn) approaches is central to the AWE valid- ation endeavors. Future research on teacher cognition could provide deeper insight into how, if at all, the use of AWE impacts teachers’ feed- back practices.

Importantly, the teacher in the Teacher Group provided more LL com- ments than Criterion alone, which may be due to a computer’s ability to comprehensively detect errors as a human can. In other words, the breath of error detection seems limited (see Leacock, Chodorow, Gamon, & Tetreault, 2010 for more information). Future research should investi- gate the impact of teacher feedback when LL feedback is provided with and without AWE. It is also critical to acknowledge that while this study explored the provision of comprehensive feedback, this decision may have contributed to the findings of this study, and research on selective feedback strategies could shed light on how teachers can integrate AWE tools in the L2 writing classroom.

Taken together, results from teacher feedback data may have been due to a number of additional factors, such as the assessment criteria or the actual number of higher-order errors in each group which can cause a ceiling effect in feedback provision. Finally, since two different teachers were feedback providers in the two groups, their actual abilities in pro- viding higher-level feedback could have affected the amount of this feed- back. Although the teachers’ similar teaching experience plus the teacher-training sessions conducted before the experiment may control for this extraneous variable, future explorations of this issue are recom- mended to address this limitation more rigorously, probably by having the same teacher provide feedback in both classes.

Assumption 2: Positive impact on student revision

Findings seem to provide some support for use of AWE to positively impact students’ revision practices. Students appeared to attend to both HL and LL feedback from both the teacher and Criterion. Thus, the tool did not confine their revision attempts to surface-level language features. Students in both groups showed high frequency of revision on ideas and elaboration, which seems proportionate to the amount and type of teacher feedback provided. Even though a thorough analysis of revision quality was beyond this study, findings seem to suggest that students were suc- cessful at addressing both teacher and AWE feedback; follow-up studies should consider how to best control for variables potentially impacting revision decisions, including feedback type, directness, and quality.

Also noteworthy is that 45% of AWE feedback units were dismissed during the revision process, which aligns with Chapelle et al. (2015).

Certainty as to why this result occurred warrants further investigation; however, studies have suggested that inaccuracies in feedback can affect students’ willingness or ability to use it (Ranalli, 2018). We suggest that the difficulty in understanding feedback can provide an alternative explanation, especially when AWE feedback has multiple layers. In fact, further perception data or data from think-aloud protocols could help to verify this potential rebuttal and ascertain the possible underlying rea- sons why students may dismiss the feedback. In addition, more research beyond revision practices is needed to determine the efficiency of the revision, the effect on writing accuracy improvement, and the lasting uptake. It is also still unclear whether AWE use indeed improves revision skills, as previous research has suggested (Stevenson & Phakiti, 2014).

Assumption 3: Positive short- and long-term impact

Promising evidence in support of AWE was obtained from the accuracy data. Both groups displayed written accuracy improvement in the short term. Over time (after a three-month interval), however, the AWE group seemed to surpass their peers in amount of retention. In particular, the sustained accuracy improvement over time validates the potential that students can internalize the knowledge they gain from AWE CF and store it in their long-term memory for later retrieval (Bitchener, 2012). This internalization is likely due to the constant nature of AWE feedback in that students can continuously seek out this feedback whereas teacher feedback is not inherently available on-demand. Yet, the Teacher Group did improve in more CAF measures (8 of 9 measures) than the AWE Group. Since in the AWE Group language issues were only addressed by the automated system, we can argue that machine seems to fall short of teacher feedback in capturing features that can potentially affect written complexity or fluency. In other words, teacher feedback may stimulate broader development of the writing construct than automated feedback. It is also likely that the technology, and the multiple revision opportuni- ties it offers, prime students to focus on accuracy at the expense of other features. In fact, second language writing research has indicated that each level of L2 writing development is characterized by improvement/ deterioration in certain aspects in the CAF triad (Johnson, 2017; Ortega, 2015). Therefore, incorporating in the current study a judiciously selected range of metrics from all three facets seemed to better capture the nuanced changes in development. Future AWE research should con- sider utilizing similar objective measures of the writing construct so that comparisons between studies can begin to take place and the value of AWE feedback can be better revealed. What is missing in this investiga- tion, though, is a criterion (or, a set of criteria) that helps appraise

constructs that are a major concern to SLA researchers, constructs like genre knowledge, structure, and idea development. One possibility is the inclusion of functionally-oriented measurements that encompass the dis- course-pragmatic dimensions of language (Ortega, 2015) and/or inclusion of these constructs in learning objectives and thus assessment rubrics for more seamless integration into classroom teaching and research. With this said, prompt and teacher effects also need to be considered. Future research may focus on using more equivalent prompts and the same teacher across conditions.

Concluding remarks

Drawing on the findings of this validation effort, we would like to reaffirm Bitchener’s (2012) conclusion that WCF does have the potential to promote learning. Yet, no validation enterprise can offer absolute answers because validity is not about certainty but credibility; and, valid- ation does not seek to prove but to argue for plausibility (Kane, 2013). We argued that one of the plausible CF “conditions” can be automated writing evaluation. The caveat here is that AWE, like any other technology, can bring about variable outcomes depending on ‘how’ it is exploited by users. It is possible that AWE use without complementary teacher feedback would not yield the same results found in our study; however, the isolated use of AWE without teacher feedback is highly discouraged; the comple- mentary nature of AWE tools needs to be underlined in writing instruc- tion (Grimes & Warschauer, 2010; Ware, 2011) and the fashion of AWE fusion into instruction likely determines its washback (Chen & Cheng, 2008; Li et al., 2015). There are a variety of ways in which AWE can be deployed, including using it as a sheer text editor, a scaffold assisting teachers in instruction, or an interface fostering collaborative writing (Stevenson, 2016). Each manner potentially brings about different conse- quences and can thus, offer different implications to AWE developers and users. Therefore, in search of the most efficient way of AWE implementa- tion, further empirical research on creative ways of implementing AWE and of combining AWE and teacher feedback and the impact each way might have on learning/teaching is invaluable.

Notes

Literature recommends that the delayed test be conducted with a longer interval than two weeks that has been the typical practice in research thus far (Karim & Nassaji, 2018). Nevertheless, after that interval, the winter break, the Persian New Year holiday, and few weeks for introducing and practicing journalistic writing (the
course in which the delayed posttest was administered) did not provide us with an opportunity to give the delayed test sooner than 3 months.
In Wilson and Czik, (2016) coding scheme, sentences structure is described as “Pertaining to the correct expression of syntax” (e.g. Don’t start a sentence with a conjunction, fragment, or run-ons) whereas grammar is defined as “Pertaining to the correct usage of parts of speech at the word-, phrase-, or sentence-levels” (e.g. verb tense, pronoun usage and pronoun agreement, conjunctions (adding, deleting, substituting), or prepositions). We have thus separated the two in our research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Notes on contributors

Stephanie Link is an Assistant Professor of TESOL/Applied Linguistics at Oklahoma State University. Her research involves technology-mediated language learning with a focus on development and use of automated writing evaluation tools and intelligent tutoring systems for second language writing and written scientific communication. Her work can be found in top journals, such as CALICO Journal, Language Learning and Technology, System, and Journal of Second Language Writing. Her recent co-edited vol- ume through Equinox Publishing, Assessment Across Online Language Education, explores critical issues in the field.

Mohaddeseh Mehrzad is a Ph.D. candidate of TEFL at Shiraz University, Iran. Her research interests include computer-assisted language learning with a focus on technol- ogy-mediated second language writing, automated writing evaluation, and electronic written corrective feedback.

Mohammad Rahimi is an Associate Professor of TEFL at Shiraz University, Iran, and a

former Assistant Professor at Universit'e du Qu'ebec A'

Montreal, Canada. His research

focuses on L2 Writing Evaluation, Written Corrective Feedback, and Dynamic Assessment

of Writing. He has published in Language Teaching Research, System, Reading and Writing, Journal of Response to Writing, and Language and Sociocultural Theory.

ORCID

Stephanie Link http://orcid.org/0000-0002-5586-1495 Mohammad Rahimi http://orcid.org/0000-0002-1692-5492

References

Attali, Y. (2004). Exploring the feedback and revision features of Criterion. Paper pre- sented at the National Council on Measurement in Education (NCME), San Diego, CA. Bennett, R. E. (2011). Formative assessment: A critical review. Assessment in Education:‌

Principles, Policy & Practice, 18, 5–25. doi:10.1080/0969594x.2010.513678

Biber, D., Nekrasova, T., & Horn, B. (2011). The effectiveness of feedback for L1-English and L2-writing development: A meta-analysis. ETS Research Report Series. Retrieved from https://www.ets.org/Media/Research/pdf/RR-11-05.pdf

Bitchener, J. (2012). A reflection on ‘the language learning potential’ of written CF.

Journal of Second Language Writing, 21(4), 348–363. doi:10.1016/j.jslw.2012.09.006 Chapelle, C. A., Cotos, E., & Lee, J. Y. (2015). Validity arguments for diagnostic assessment

using automated writing evaluation. Language Testing, 32(3), 385–405. doi:10.1177/ 0265532214565386

Chen, C., & Cheng, W. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology, 12, 94–112.

Chen, H., Xu, J., & He, B. (2014). Automated essay scoring by capturing relative writing quality. The Computer Journal, 57(9), 1318–1330. doi:10.1093/comjnl/bxt117

Conference on College Composition and Communication. (2009). Writing assessment: A position statement. Retrieved from http://www.ncte.org/cccc/resources/positions/ digitalenvironments

Cotos, E. (2012). Towards effective integration and positive impact of automated writing evalu-

ation in L2 writing. In G. Kessler, A. Oskoz, & I. Elola (Eds.), Technology across writing con- texts and tasks, CALICO Monograph Series (Vol. 10, pp. 81–112). Calico: San Marcos, TX.

Cotos, E. (2014). Genre-based automated writing evaluation for L2 research writing: From design to evaluation and enhancement. New York, NY: Palgrave Macmillan.

Cotos, E. (2015). AWE for writing pedagogy: From healthy tension to tangible prospects. Special issue on Assessment for Writing and Pedagogy. Writing & Pedagogy, 7(2–3), 197–231. doi:10.1558/wap.v7i2-3.26381

Cotos, E., Link, S., & Huffman, S. (2017). Effects of DDL technology on genre learning. Special Issue in Language Learning and Technology, 21(3), 104–130. Retrieved from http://llt.msu.edu/issues/october2017/cotoslinkhuffman.pdf

Cunningham, K. J. (2019). Student perceptions and use of technology-mediated text and screencast feedback in ESL writing. Computers and Composition, 52, 222–241. doi:10. 1016/j.compcom.2019.02.003

Dikli, S. (2010). The nature of automated essay scoring feedback. Calico Journal, 28, 99–134. doi:10.11139/cj.28.1.99-134

Dikli, S., & Bleyle, S. (2014). Automated essay scoring feedback for second language writers: How does it compare to instructor feedback?. Assessing Writing, 22, 1–17. doi:10.1016/j.asw.2014.03.006

Enright, M. K., & Quinlan, T. (2010). Complementing human judgment of essays writ- ten by English language learners with e-rater scoring. Language Testing, 27(3), 317–334. doi:10.1177/0265532210363144

ETS. (2019). About the CriterionVR Online Writing Evaluation Service. Retrieved from: https://www.ets.org/criterion/about/

Ferris, D. (1997). The influence of teacher commentary on student revision. TESOL

Quarterly, 31(2), 315–339. doi:10.2307/3588049

Ferris, D. (2012). Written corrective feedback in second language acquisition and writing studies. Language Teaching, 45(4), 446–459. doi:10.1017/S0261444812000250

Ferris, D. R. (2010). Second language writing research and written corrective feedback in SLA: Intersections and practical applications. Studies in Second Language Acquisition, 32(2), 181–201. doi:10.1017/S0272263109990490

Ferris, D. R. (2011). Treatment of error in second language student writing (2nd ed.).

Ann Arbor, MI: University of Michigan Press.

Ferris, D. R., & Hedgcock, J. S. (2005). Teaching ESL composition: Purpose, process, and practice. Mahwah, NJ: Erlbaum.

Ferris, D. R., & Roberts, B. (2001). Error feedback in L2 writing classes: How explicit does it need to be? Journal of Second Language Writing, 10(3), 161–184. doi:10.1016/ S1060-3743(01)00039-X ‌

Grimes, D., & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. The Journal of Technology, Language, and Assessment, 8(6), 1–43. Retrieved from http://ejournals.bc.edu/ojs/index.php/jtla/article/download/ 1625/1469

Housen, A., Kuiken, F., & Vedder, I. (Eds.), (2012). Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA. Amsterdam: John Benjamins.

Hulstijn, J. H., & Marchena, E. (1989). Avoidance: Grammatical or semantic clauses? Studies in Second Language Acquisition, 11(3), 241–255. doi:10.1017/S0272263100008123 Hyland, K., & Hyland, F. (2019). Feedback in second language writing: Context and

issues. New York, NY: Cambridge University Press.

Johnson, M. D. (2017). Cognitive task complexity and L2 written syntactic complexity, accuracy, lexical complexity, and fluency: A research synthesis and meta-analysis. Journal of Second Language Writing, 37, 13–38. doi:10.1016/j.jslw.2017.06.001

Johnson, M. D., Mercado, L., & Acevedo, A. (2012). The effect of planning sub-processes on L2 writing fluency, grammatical complexity, and lexical complexity. Journal of Second Language Writing, 21(3), 264–282. doi:10.1016/j.jslw.2012.05.011

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. doi:10.1111/jedm.12000

Kang, U., & Han, Z. (2015). The efficacy of written corrective feedback in improving l2 written accuracy: A meta-analysis. The Modern Language Journal, 99(1), 1–18. doi:10.1111/modl.12189

Karim, K., & Nassaji, H. (2018). The revision and transfer effects of direct and indirect comprehensive corrective feedback on ESL students’ writing. Language Teaching Research. doi:10.1177/1362168818802469

Leacock, C., Chodorow, M., Gamon, M., & Tetreault, J. (2010). Automated grammatical error detection for language learners. Synthesis Lectures on Human Language Technologies, 3(1), 1–134. doi:10.2200/S00275ED1V01Y201006HLT009

Lee, I. (2011). Working smarter, not working harder: Re-visiting teacher feedback in the L2 writing classroom. Canadian Modern Language Review, 67(3), 377–399. doi:10.3138/ cmlr.67.3

Lee, I. (2019). Teachers’ frequently asked questions about focused written corrective feedback. TESOL Journal, 10(3), 1–15. doi:10.1002/tesj.427

Lee, J., & Hegelheimer, V. (2012, August). A hybrid use of CriterionVR and teacher feedback in process writing. Paper presented at EUROCALL conference, Gothenburg, Sweden.

Li, Z., Feng, H. H., & Saricaoglu, A. (2017). The short-term and long-term effects of AWE feedback on ESL students’ development of grammatical accuracy. CALICO Journal, 34 (3), 355–375. doi:10.1558/cj.26382

Li, J., Link, S., & Hegelheimer, V. (2015). Rethinking the role of automated writing evaluation (AWE) feedback in ESL writing instruction. Journal of Second Language Writing, 27, 1–18. doi:10.1016/j.jslw.2014.10.004

Link, S., Dursun, A., Karakaya, K., & Hegelheimer, V. (2014). Towards best ESL practi- ces for implementing automated writing evaluation. CALICO Journal, 31(3), 323–344. doi:10.11139/cj.31.3.323-344

Liu, Q., & Brown, D. (2015). Methodological synthesis of research on the effectiveness of corrective feedback in L2 writing. Journal of Second Language Writing, 30, 66–81. doi:10.1016/j.jslw.2015.08.011

Lu, X. (2017). Automated measurement of syntactic complexity in corpus-based L2 writ- ing research and implications for writing assessment. Language Testing, 34, 493–511. doi:10.1177/0265532217710675 ‌

Lu, X. (2011). A corpus-based evaluation of syntactic complexity measures as indices of college level ESL writers’ language development. TESOL Quarterly, 45, 36–62. doi:10. 5054/tq.2011.240859

Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4), 474–496. doi:10.1075/ijcl.15.4.02lu Mancho'n, R. M. (2011). Situating the learning-to-write and writing-to-learn dimensions

of L2 writing. In R. M. Mancho'n (Ed.), Learning-to-write and writing-to-learn in an

additional language (pp. 3–14). Philadelphia, PA: John Benjamins Publishing Company.

McNamara, D. S., Graesser, A. C., McCarthy, P., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge, UK: Cambridge University Press.

Norris, J. M., & Ortega, L. (2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. doi:10.1093/ applin/amp044

Ortega, L. (2015). Syntactic complexity in L2 writing: Progress and expansion. Journal of Second Language Writing, 29, 82–94. doi:10.1016/j.jslw.2015.06.008

Plakans, L., Gebril, A., & Bilki, Z. (2019). Shaping a score: Complexity, accuracy, and fluency in integrated writing performances. Language Testing, 36 (2), 161–179. doi:10. 1177/0265532216669537

Plonsky, L., & Oswald, F. L. (2014). How Big Is “Big”? Interpreting effect sizes in L2

research. Language Learning, 64(4), 878–912. doi:10.1111/lang.12079

Polio, C. (1997). Measures of linguistic accuracy in second language writing research.

Language Learning, 47(1), 101–143. doi:10.1111/0023-8333.31997003

Polio, C. (2001). Research methodology in second language writing: The case of text-based studies. In T. Silva, & P. Matsuda (Eds.), On second language writing (pp. 91–116). Mahwah, NJ: Lawrence Erlbaum Associates.

Polio, C., & Shea, M. C. (2014). An investigation into current measures of linguistic accuracy in second language writing research. Journal of Second Language Writing, 26, 10–27. doi:10.1016/j.jslw.2014.09.003

Quick Placement Test. (2001). Oxford: Oxford University Press.

Rahimi, M. (2019). A comparative study of the impact of focused vs. comprehensive cor- rective feedback and revision on ESL learners’ writing accuracy and quality. Language Teaching Research. doi:10.1177/1362168819879182

Ranalli, J. (2018). Automated written corrective feedback: How well can students make use of it. Computer Assisted Language Learning, 31(7), 653–674. doi:10.1080/09588221. 2018.1428994

Stevenson, M. (2016). A critical interpretative synthesis: The integration of Automated Writing Evaluation into classroom writing instruction. Computers and Composition, 42, 1–16. doi:10.1016/j.compcom.2016.05.001

Stevenson, M., & Phakiti, A. (2014). The effects of computer generated feedback on the quality of writing. Assessing Writing, 19, 51–65. doi:10.1016/j.asw.2013.11.007

Truscott, J. (1996). The case against grammar correction in L2 writing classes. Language Learning, 46(2), 327–369. doi:10.1111/j.1467-1770.1996.tb01238.x

Van Beuningen, C. (2010). Corrective feedback in L2 writing: Theoretical perspectives, empirical insights, and future directions. International Journal of English Studies, 10(2), 1–27. doi:10.6018/ijes/2010/2/119171

Verspoor, M., Schmid, M. S., & Xu, X. (2012). A dynamic usage based perspective on L2 writing. Journal of Second Language Writing, 21(3), 239–263. doi:10.1016/j.jslw. 2012.03.007 ‌

Victori, M. (1995). EFL writing knowledge and strategies: An interactive study (Unpublished PhD dissertation). Universitat Autonoma de Barcelona (Spain), Barcelona.

Wang, P.-L. (2015). Effects of an automated writing evaluation program: Student experien- ces and perceptions. Electronic Journal of Foreign Language Teaching, 12(1), 79–100.

Wang, Y. J., Shang, H. F., & Briody, P. (2013). Exploring the impact of using automated writing evaluation in English as a foreign language university students’ writing. Computer Assisted Language Learning, 26 (3), 234–257. doi:10.1080/09588221.2012.

655300

Ware, P. (2011). Computer-generated feedback on student writing. TESOL Quarterly, 45(4), 769–774. doi:10.5054/tq.2011.272525

Ware, P. (2014). Feedback for adolescent writers in the English classroom: Exploring pen-and-paper, electronic, and automated options. Writing & Pedagogy, 6(2), 223–249. doi:10.1558/wap.v6i2.223

Warschauer, M. (2010). Invited commentary: New tools for teaching writing. Language Learning & Technology, 14 (1), 3–8.

Warschauer, M., & Grimes, D. (2008). Automated writing assessment in the classroom.

Pedagogies: An International Journal, 3(1), 22–36. doi:10.1080/15544800701771580 Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the class-

room research agenda. Language Teaching Research, 10(2), 157–180. doi:10.1191/

1362168806lr190oa

Weigle, S. C. (2013). English language learners and automated scoring of essays: Critical considerations. Assessing Writing, 18(1), 85–99. doi:10.1016/j.asw.2012.10.006

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13. doi:10.1111/ j.1745-3992.2011.00223.x

Wilson, J., & Czik, A. (2016). Automated essay evaluation software in English Language Arts classrooms: Effects on teacher feedback, student motivation, and writing quality. Computers & Education, 100, 94–109. doi:10.1016/j.compedu.2016.05.004

Xi, X. (2010). Automated scoring and feedback systems. Language Testing, 27(3), 291–300. doi:10.1177/0265532210364643

Yoon, H.-J., & Polio, C. (2016). The linguistic development of students of English as a second language in two written genres. TESOL Quarterly, 51, 275–301. doi:10.1002/ tesq.296